feat: support batch `/completions` #1626

ishandhanani · 2025-06-24T20:50:16Z

This PR is a revamp of #1565 based on this comment by @rmccorm4.

Description

OpenAI Completions supports the following inputs

pub enum Prompt {
    String(String),
    StringArray(Vec<String>),
    // Minimum value is 0, maximum value is 50256 (inclusive).
    IntegerArray(Vec<u16>),
    ArrayOfIntegerArray(Vec<Vec<u16>>),
}

The PR here provides batch style support for StringArray, and ArrayOfIntegerArray and provides similar support as String (default) to IntegerArray.

The sglang_inc.py engine has been updated to demonstrate this

Approach

In order to minimize any performance hit, I first match based on input. If we see a tokens style input, we cast to u32 and construct the request. If not - we move forward with tokenization as expected

Tests with all 4 input types

Script to tokenize text

from transformers import AutoTokenizer

# 1. Choose the model ID
model_name = "Qwen/Qwen2.5-7B"

# 2. Load the tokenizer
#    (the Qwen3 tokenizer is bundled in the repo on HF)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 3. Your text
text = "A large language model is a"

# 4. Encode to get token IDs (omit special tokens if you just want raw subwords)
encoding = tokenizer(text, add_special_tokens=False)

# 5. Inspect the results
print("Token IDs: ", encoding["input_ids"])

Using the following for testing

[32, 3460, 4128, 1614, 374, 264]
A large language model is a
Qwen/Qwen2.5-7B

IntegerArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": [32, 3460, 4128, 1614, 374, 264],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-3f3f10ca-320d-4740-b4c6-d044436a0655",
  "choices": [
    {
      "text": " machine learning tool conceived by Google last August.\n\nImagine putting Strings into an ocean, taking a dip and then catching one for a meal.\n\n\nReth",
      "index": 0,
      "finish_reason": null
    }
  ],
  "created": 1750819021,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 29,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

ArrayOfIntegerArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": [[32, 3460, 4128, 1614, 374, 264], [32, 3460, 4128, 1614, 374, 264]],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-50006ebd-b759-41fc-9d5f-335651b1910d",
  "choices": [
    {
      "text": " type of artificial intelligence designed to carry out human-like, one-sided conversations. Influenced by developments in the NLP/NLU “Big Bang”,",
      "index": 0,
      "finish_reason": null
    },
    {
      "text": " computer model. But that a shorthand. Apparently confusing. A large new language model is both big and notoriously vague. But stay tuned.— Casey C",
      "index": 1,
      "finish_reason": null
    }
  ],
  "created": 1750819161,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 58,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

String (with "nvext": {"use_raw_prompt":true)

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": "A large language model is a",
    "stream": false,
    "max_tokens": 30,
    "nvext": {"use_raw_prompt":true}
  }'

{
  "id": "cmpl-c2f7ef91-dff9-4a19-b3fa-75b41779f5f4",
  "choices": [
    {
      "text": " complex mathematical system that can be used to solve a wide variety of problems. The model consists of a sequence of tasks, each of which is solved",
      "index": 0,
      "finish_reason": null
    }
  ],
  "created": 1750819802,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 6,
    "completion_tokens": 29,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

StringArray

input

curl localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "Qwen/Qwen2.5-7B",
    "prompt": ["A large language model is a", "A large language model is a"],
    "stream": false,
    "max_tokens": 30
  }'

output

{
  "id": "cmpl-a51a4d40-9e82-4aec-80c8-140073ecd0fd",
  "choices": [
    {
      "text": " model trained on a diverse dataset consisting of text, images, audio, and natural language processing data. These models allow developers to take feedback from humans",
      "index": 0,
      "finish_reason": null
    },
    {
      "text": " math-informed system which can be used for predicting choices along with processing output for respective ones. Summarizing Deep Learning Books Artificial Intelligence is considered",
      "index": 1,
      "finish_reason": null
    }
  ],
  "created": 1750820088,
  "model": "Qwen/Qwen2.5-7B",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 58,
    "total_tokens": 0,
    "prompt_tokens_details": null,
    "completion_tokens_details": null
  }
}

copy-pr-bot · 2025-06-24T20:50:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-06-24T20:51:12Z

Walkthrough

This update refactors prompt preprocessing to explicitly handle both text and tokenized prompt inputs, adds support for batch token IDs, and extends trait and request implementations to distinguish and extract token-based inputs. Additionally, it modifies SSE event filtering for LLM metric annotations to depend on the DYN_RICH_EVENT_STREAM environment variable.

Changes

File(s)	Change Summary
lib/llm/src/http/service/openai.rs	SSE event filtering for LLM metric annotations now depends on `DYN_RICH_EVENT_STREAM` environment variable.
lib/llm/src/preprocessor.rs	Refactored preprocessing logic to explicitly handle text vs. tokenized prompt inputs and set token IDs accordingly.
lib/llm/src/preprocessor/prompt.rs	Added `TokenInput` and `PromptInput` enums; extended `OAIChatLikeRequest` trait with input type and token extraction methods.
lib/llm/src/preprocessor/prompt/template/oai.rs	Implemented new trait methods for `NvCreateCompletionRequest` to classify and extract token inputs from prompts.
lib/llm/src/protocols/common/preprocessor.rs	Added optional `batch_token_ids` field to `PreprocessedRequest` struct for batch token input support.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant OpenAIPreprocessor
    participant Request
    participant PreprocessedRequestBuilder

    Client->>OpenAIPreprocessor: preprocess_request(Request)
    OpenAIPreprocessor->>Request: prompt_input_type()
    alt Prompt is Tokens
        OpenAIPreprocessor->>Request: extract_tokens()
        OpenAIPreprocessor->>PreprocessedRequestBuilder: set token_ids or batch_token_ids
    else Prompt is Text
        OpenAIPreprocessor->>Request: get raw or formatted prompt
        OpenAIPreprocessor->>OpenAIPreprocessor: tokenize prompt
        OpenAIPreprocessor->>PreprocessedRequestBuilder: set token_ids
    end
    OpenAIPreprocessor->>PreprocessedRequestBuilder: set sampling_options, annotations
    PreprocessedRequestBuilder->>Client: PreprocessedRequest

Possibly related PRs

fix: remove LLMMetricAnnotation from response stream #1499: Also modifies LLM metric annotation filtering in SSE streams, but without the new environment variable condition.
refactor: use comment filed in annotated to pass metric-related information #1385: Refactors how metric-related token counts are encoded in annotated events, centralizing metrics in the comment field; both PRs deal with LLM metric annotation handling in SSE or response streams.

Poem

In fields of code where prompts may roam,
Now tokens march or text may comb.
Batch or single, all inputs shine,
With metrics streaming down the line.
A bunny hops through preprocess land—
Richer prompts now close at hand!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

ishandhanani · 2025-06-25T07:33:12Z

There's some Dockerfile changes here that I'm using to test. I'll be removing them before I merge this PR. They belong in #1583

paulhendricks

It might be nice to pull out some of the internals of OpenAIPreprocessor so we can add test coverage in the module for different edge cases e.g. "", ["", ""], [], [[]] instead of using the e2e with curl scripts.

Overall approving, looks good!

lib/llm/src/preprocessor.rs

lib/llm/src/protocols/openai/completions/delta.rs

lib/llm/src/preprocessor.rs

examples/sglang/components/worker.py

container/Dockerfile.sglang-deepep

lib/llm/src/protocols/common/preprocessor.rs

container/Dockerfile.sglang-deepep

ishandhanani added 6 commits June 14, 2025 18:47

env var controlled

1e110af

Merge branch 'main' of github.com:ai-dynamo/dynamo

4010276

Merge branch 'main' of github.com:ai-dynamo/dynamo

bd7e81d

Merge branch 'main' of github.com:ai-dynamo/dynamo

b7801c8

init

524204e

bump

80cca39

ishandhanani requested review from a team, GuanLuo, PeaBrane, alec-flowers, biswapanda, grahamking, jthomson04, kkranen, oandreeva-nv, paulhendricks, rmccorm4, ryanolson and tmonty12 as code owners June 24, 2025 20:50

pull-request-size bot added the size/L label Jun 24, 2025

github-actions bot added the feat label Jun 24, 2025

ishandhanani added 2 commits June 24, 2025 20:50

Merge branch 'main' of github.com:ai-dynamo/dynamo

6b44876

Merge branch 'main' into ishan/cmpl-token-id

38aec4a

ishandhanani added 2 commits June 24, 2025 20:52

ok

a25b54d

sgl inc

b2fd84f

ishandhanani requested review from nnshah1, piotrm-nvidia and tanmayv25 as code owners June 24, 2025 20:53

ishandhanani added 2 commits June 25, 2025 07:23

bring back tokkio task for encoding

7fb52e6

tokkio

753b21f

ishandhanani added 3 commits June 25, 2025 08:03

mypy

a883e2a

p

cb16e9a

bump

26b0a58

paulhendricks approved these changes Jun 25, 2025

View reviewed changes

lib/llm/src/preprocessor.rs Show resolved Hide resolved

lib/llm/src/protocols/openai/completions/delta.rs Show resolved Hide resolved

ishandhanani mentioned this pull request Jun 25, 2025

[OPTIMIZATION]: Parallelize tokenization during batch completions #1645

Closed

tanmayv25 approved these changes Jun 25, 2025

View reviewed changes

bump

2f84f55

richardhuo-nv approved these changes Jun 25, 2025

View reviewed changes

ishandhanani and others added 2 commits June 25, 2025 12:37

go

b24863a

Merge branch 'main' into ishan/cmpl-token-id

6af8d7d

ishandhanani enabled auto-merge (squash) June 25, 2025 19:43